Statistical Language Identification of Short Texts
نویسندگان
چکیده
Although correctly identifying the language of short texts should prove useful in a large number of applications, few satisfactory attemps are reported in the literature. In this paper we describe a Naive Bayes Classifier that performs well on very short texts, as well as the corpus that we created from movie subtitles for training it. Both the corpus and the algorithm are available under the GNU Lesser General Public License.
منابع مشابه
The Impact of Input Enrichment in Long Text vs. Short Texts on Grammatical Accuracy in Writing Among Elementary Language Learners
This study was conducted to investigate the influence of teaching accurate grammar inwriting via enriched long text and short text for the elementary students atShokouhe_Farhang institute. The homogenized subjects were divided into two groups of 18and 17 participants. Using a writing exam as a pretest in order to check the students’knowledge in English past tense. The control group received the...
متن کاملLanguage Identification Based on High Frequency Approaches
This paper deals with the problem of automatic language identification of noisy texts, which represents an important task in natural language processing. Actually, there exist several works in this field, which are based on statistical and machine learning approaches for different categories of texts. Unfortunately, most of the proposed methods work fine on clean texts or long texts, but often ...
متن کاملEntity Recognition and Language Identification with FELTS
This working notes describe the experiments we conducted in the Microblog Cultural Contextualization Lab [2] of CLEF 2017 [3]. The microblog data is composed of very short texts, with very heterogeneous styles. Some of them are written in more than one language. We decided to takle the entity recognition problem by using a non-statistical, dictionary-based, multiword term extractor. On the othe...
متن کاملGraph-Based N-gram Language Identification on Short Texts
Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA ...
متن کاملAdapting Statistical Language Identification Methods for Short Queries
This paper describes the participation of UAIC team at the LogCLEF 2011 initiative, language identification task. Our approach is an aggregation of known methods for recognizing languages. Short texts are a real challenge in applying a language identification tool; so, our methods had to comply with it by resisting to noisy data as only one letter, only numbers, links, different symbols. We app...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011